3  Working With Data

3.1 Checking variable types

Once a data frame has been read into R, it is always a good idea to examine its contents using the str() function to see the structure of the data object. We have already seen the str() function in Lab 2 but as a reminder, it shows us the type of vector each column in a data frame is saved as.

Running the following code tells us that the four variables ldl, hdl, trig and age are all integer vectors and that id, gender and smoke are character vectors.

str(chol)
'data.frame':   13 obs. of  7 variables:
 $ id    : chr  "P912" "P215" "P063" "P117" ...
 $ ldl   : int  175 196 139 162 140 147 82 165 149 95 ...
 $ hdl   : int  25 36 65 37 117 51 81 63 49 54 ...
 $ trig  : int  148 92 NA 139 59 126 NA 120 NA 157 ...
 $ age   : int  39 32 42 30 42 65 57 48 32 55 ...
 $ gender: chr  "female" "female" "male" "female" ...
 $ smoke : chr  "no" "no" NA "ex-smoker" ...

Because gender only takes the values "female" or "male" in this case, and smoke is categorised into three levels, "no", "ex-smoker" and "current", it makes sense to treat both these variables as factors instead of character vectors.

We can use what we learned in Lab 2 to change these variables into factors.

chol$gender <- factor(x = chol$gender, levels = c("female", "male"))
chol$smoke <- factor(x = chol$smoke, levels = c("no", "ex-smoker", "current"))

Now using str() to check the type of vector each column is saved as shows us that gender and smoke are both now factors.

str(chol)
'data.frame':   13 obs. of  7 variables:
 $ id    : chr  "P912" "P215" "P063" "P117" ...
 $ ldl   : int  175 196 139 162 140 147 82 165 149 95 ...
 $ hdl   : int  25 36 65 37 117 51 81 63 49 54 ...
 $ trig  : int  148 92 NA 139 59 126 NA 120 NA 157 ...
 $ age   : int  39 32 42 30 42 65 57 48 32 55 ...
 $ gender: Factor w/ 2 levels "female","male": 1 1 2 1 1 1 2 2 1 1 ...
 $ smoke : Factor w/ 3 levels "no","ex-smoker",..: 1 1 NA 2 2 2 1 3 1 2 ...

Task

What type of variable is schools saved as in the education data frame?

Using the str() function shows us that schools is saved as an integer variable.

str(education)
'data.frame':   21 obs. of  5 variables:
 $ year    : int  2016 2016 2016 2017 2017 2017 2018 2018 2018 2019 ...
 $ level   : chr  "ELC" "Primary" "Secondary" "ELC" ...
 $ schools : int  2514 2031 359 2532 2019 360 2544 2012 357 2576 ...
 $ teachers: int  985 23920 22957 921 24477 23150 821 NA 23317 798 ...
 $ pupils  : int  96961 396697 280983 95893 400312 281993 96549 400276 286152 96375 ...

Task

Complete the code to change the variables year and level in education to be factor variables.

education$year <-(x =,
    levels =)

education$level <-(x =,
    levels = c())
education$year <- factor(x = education$year,
                         levels = 2016:2022)

education$level <- factor(x = education$level,
                          levels = c("ELC", "Primary", "Secondary"))

Refer to Section 1.11 Working with Data of Probability and Statistics with R to learn more about checking the setup of a data set.

3.2 Dealing with NA values

Data sets will often have missing values for a variety of different reasons; maybe because of human error, maybe because information was not disclosed or maybe because of a failed experiment for example. When data is correctly read into R these unknown values will be denoted by NA. In order to conduct analysis or perform calculations on your data, you may wish to remove these missing values from your data set. Always think about whether this is an appropriate thing to do.

One way in which we can remove missing values from a data set is to use the function na.omit(). This will return the data frame with any ‘incomplete cases’ removed. That is, any rows that have NA as the value for any variable will be removed from the data frame.

Looking at chol, we can see that there are missing values in rows 3, 7 and 9.

chol
        id ldl hdl trig age gender     smoke
1   1 P912 175  25  148  39 female        no
2   2 P215 196  36   92  32 female        no
3   3 P063 139  65   NA  42   male      <NA>
4   4 P117 162  37  139  30 female ex-smoker
5   5 P613 140 117   59  42 female ex-smoker
6   6 P332 147  51  126  65 female ex-smoker
7   7 P951  82  81   NA  57   male        no
8   8 P004 165  63  120  48   male   current
9   9 P725 149  49   NA  32 female        no
10 10 P901  95  54  157  55 female ex-smoker
11 11 P103 169  59   67  48 female        no
12 12 P843 174 117  168  41 female        no
13 13 P753  91  52  146  69 female   current

If we run the following code, then these rows are removed from the data frame and we are left with only the ‘complete cases’.

na.omit(chol)
     id ldl hdl trig age gender     smoke
1  P912 175  25  148  39 female        no
2  P215 196  36   92  32 female        no
4  P117 162  37  139  30 female ex-smoker
5  P613 140 117   59  42 female ex-smoker
6  P332 147  51  126  65 female ex-smoker
8  P004 165  63  120  48   male   current
10 P901  95  54  157  55 female ex-smoker
11 P103 169  59   67  48 female        no
12 P843 174 117  168  41 female        no
13 P753  91  52  146  69 female   current

Note that na.omit() preserves the original row labels. This means that there are no rows labelled 3, 7 or 9 in the resulting data frame because they have been completely removed.

complete.cases() is another useful function that can be used to remove rows that have NA values. This returns a logical vector, the same length as the number of rows of the data frame, that indicates whether a row contains any NA values (FALSE), or whether it is ‘complete’ (TRUE).

complete.cases(chol)
 [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
[13]  TRUE

Again we can see that the rows with missing values in chol are rows 3, 7 and 9 (since the third, seventh and ninth values in the output above are all FALSE). We can then use this logical vector to extract the rows which are complete from chol.

chol[complete.cases(chol), ]
     id ldl hdl trig age gender     smoke
1  P912 175  25  148  39 female        no
2  P215 196  36   92  32 female        no
4  P117 162  37  139  30 female ex-smoker
5  P613 140 117   59  42 female ex-smoker
6  P332 147  51  126  65 female ex-smoker
8  P004 165  63  120  48   male   current
10 P901  95  54  157  55 female ex-smoker
11 P103 169  59   67  48 female        no
12 P843 174 117  168  41 female        no
13 P753  91  52  146  69 female   current

Here, using na.omit() and complete.cases() have returned the same output.

Task

Which rows in education have missing values?

Using complete.cases() shows us that rows 8 and 16 of education are incomplete and therefore contain NA values.

complete.cases(education)
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

Task

Write code to remove all rows in education which contain NA values.

Incomplete observations can be removed from education using any of the following lines of code.

na.omit(education)
education[complete.cases(education), ]
education[!is.na(education$teachers), ]

In the case where we only want to know which entries of a vector or specific variable in a data frame are NA, we can use the function is.na(). For example, if missing values in the trig variable were not of concern but we wanted to identify missing values in the smoke column, we could use the following code.

is.na(chol$smoke)
 [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE

We can see that only the third row has the value NA for smoke, since the third element in the output from is.na() above is TRUE. In order to remove the row where smoke has a missing value, we can use the following code to index the chol data frame.

chol[!is.na(chol$smoke), ]
     id ldl hdl trig age gender     smoke
1  P912 175  25  148  39 female        no
2  P215 196  36   92  32 female        no
4  P117 162  37  139  30 female ex-smoker
5  P613 140 117   59  42 female ex-smoker
6  P332 147  51  126  65 female ex-smoker
7  P951  82  81   NA  57   male        no
8  P004 165  63  120  48   male   current
9  P725 149  49   NA  32 female        no
10 P901  95  54  157  55 female ex-smoker
11 P103 169  59   67  48 female        no
12 P843 174 117  168  41 female        no
13 P753  91  52  146  69 female   current

Note that we use ! in front of is.na() so that the logical vector returned has the value TRUE when values are complete and FALSE when values are missing i.e. NA.


You can look at further examples of dealing with missing data in Section 1.11.1 Dealing with NA Values of Probability and Statistics with R.

3.3 Sorting data frames

When investigating your data sets, you may want to order the values of a particular variable in increasing or decreasing order. This is easily done using the sort() function.

For example, we can view the ages of all subjects in chol, in increasing order, using the code below.

sort(chol$age)
 [1] 30 32 32 39 41 42 42 48 48 55 57 65 69

Note that if we wanted to view these ages in decreasing order, we would add the argument decreasing = TRUE to the sort() function.

Task

What is the largest value for pupils from the education data frame?

sort(education$pupils, decreasing = TRUE)[1]
[1] 400312

The downside of using sort() is that we can only see the values from one variable of a data frame. If instead we wanted to order all subjects in chol from the youngest to the oldest and still see the values of all the other variables, we can use the function order().

order() will return a vector showing which row has the smallest value, then the second smallest value and so on. For example, the following code shows us that the fourth subject in chol is the youngest and the thirteenth subject is the eldest.

order(chol$age)
 [1]  4  2  9  1 12  3  5  8 11 10  7  6 13

We can then use this vector to index the full data frame chol and see all the variables for each subject at once.

chol[order(chol$age), ]
     id ldl hdl trig age gender     smoke
4  P117 162  37  139  30 female ex-smoker
2  P215 196  36   92  32 female        no
9  P725 149  49   NA  32 female        no
1  P912 175  25  148  39 female        no
12 P843 174 117  168  41 female        no
3  P063 139  65   NA  42   male      <NA>
5  P613 140 117   59  42 female ex-smoker
8  P004 165  63  120  48   male   current
11 P103 169  59   67  48 female        no
10 P901  95  54  157  55 female ex-smoker
7  P951  82  81   NA  57   male        no
6  P332 147  51  126  65 female ex-smoker
13 P753  91  52  146  69 female   current

In the output above, note that there are multiple subjects aged 32, 42 and 48. After ordering by age, R automatically shows these subjects with the same age in order of increasing row number. We could however add a second or third argument to order() to order the rows by another variable in the case where there are repeated values of the first variable.

For example, the following code orders all the subjects in chol by age first, and then for any subjects that are the same age, they will then be sorted in order of increasing ldl.

chol[order(chol$age, chol$ldl), ]
     id ldl hdl trig age gender     smoke
4  P117 162  37  139  30 female ex-smoker
9  P725 149  49   NA  32 female        no
2  P215 196  36   92  32 female        no
1  P912 175  25  148  39 female        no
12 P843 174 117  168  41 female        no
3  P063 139  65   NA  42   male      <NA>
5  P613 140 117   59  42 female ex-smoker
8  P004 165  63  120  48   male   current
11 P103 169  59   67  48 female        no
10 P901  95  54  157  55 female ex-smoker
7  P951  82  81   NA  57   male        no
6  P332 147  51  126  65 female ex-smoker
13 P753  91  52  146  69 female   current

Task

Write code to sort the observations from education in decreasing order of the number of pupils.

We need to include the argument decreasing = TRUE within the function order() so that the observations are ordered from largest number of pupils to the smallest number of pupils. We can use the order() function within square brackets to show all variables in the data frame in order of decreasing number of pupils.

education[order(education$pupils, decreasing = TRUE), ]
   year     level schools teachers pupils
5  2017   Primary    2019    24477 400312
8  2018   Primary    2012       NA 400276
11 2019   Primary    2004    25027 398794
2  2016   Primary    2031    23920 396697
14 2020   Primary    2005    25651 393957
17 2021   Primary    2001    25807 390313
20 2022   Primary    1994    25451 388920
21 2022 Secondary     358    24874 309133
18 2021 Secondary     357    24782 306811
15 2020 Secondary     357    24077 300954
12 2019 Secondary     358    23522 292063
9  2018 Secondary     357    23317 286152
6  2017 Secondary     360    23150 281993
3  2016 Secondary     359    22957 280983
1  2016       ELC    2514      985  96961
7  2018       ELC    2544      821  96549
10 2019       ELC    2576      798  96375
4  2017       ELC    2532      921  95893
19 2022       ELC    2606      734  92615
16 2021       ELC    2630       NA  91603
13 2020       ELC    2587      729  90126

Look at Section 1.11.3 Sorting a Data Frame by One or More of Its Columns of Probability and Statistics with R to learn more about sorting and ordering data sets.

3.4 Subsetting

When we want to only view particular elements of a data frame, this is known as subsetting the data. This is useful if you’re dealing with extremely large data sets and only want to analyse female subjects, or subjects who are all from the same country for example. Subsetting the data means that you would extract only these subjects that you are actually interested in.

A useful function for extracting elements of a data frame is the function subset() (which we first saw in Lab 1). This allows us to extract the elements of a data frame which meet particular conditions. The arguments that subset() takes are:

  • x =: this is the data frame that we want to extract particular elements from.
  • subset =: this is a logical statement which determines the elements to keep in the subsetted data frame.
  • select =: this shows the column or columns from the data frame which the logical statement should be applied to.

For example, if we wanted to view the subjects in chol who have an LDL of greater than 170, then we can use the following code.

subset(x = chol, subset = ldl > 170, select = ldl)
   ldl
1  175
2  196
12 174

This shows us that there are three patients with LDL greater than 170 (subset = ldl > 170) and we can also see the values of LDL for these patients (select = ldl).

If we wanted to see the values of the other variables in the data frame for only those patients with LDL greater than 170, then we can simply leave out the select = argument.

subset(x = chol, subset = ldl > 170)
     id ldl hdl trig age gender smoke
1  P912 175  25  148  39 female    no
2  P215 196  36   92  32 female    no
12 P843 174 117  168  41 female    no

Note that it is also possible to subset a data frame using logical statements within square brackets, [ ]. We could return the same output as above by indexing the chol using the following code.

chol[chol$ldl > 170, ]
     id ldl hdl trig age gender smoke
1  P912 175  25  148  39 female    no
2  P215 196  36   92  32 female    no
12 P843 174 117  168  41 female    no

Task

Write some code to subset education to show the number of schools that have a collective total of more than 310,000 pupils in the years 2020, 2021 or 2022.

The data frame that we want to subset is education, so this is what we’ll feed in to the argument x =.

Since the question asks us to look for a collective total of more than 310,000 pupils, this means we want to only see the rows where the value for pupils is greater than 310,000. We also only want to see rows from the years 2020, 2021 or 2022. Because year is a factor, we need to specify each level that we are interested in. This means that we are looking for rows in which pupils > 310000 AND year == "2020" or year == "2021" or year == "2021". This is quite a lengthy logical statement in the following code.

The question also asks us to only show the number of schools for which these statements are true i.e. the column schools. To do this, we simply feed this variable to the select = argument.

subset(x = education,
       subset = pupils > 310000 & year == "2020" | 
         pupils > 310000 & year == "2021" | 
         pupils > 310000 & year == "2022",
       select = schools)

A way we can shorten the logical statement in the subset = argument is to use the operator %in%. This will search for values in a vector and return the rows in which any of these values appear.

subset(x = education,
       subset = pupils > 310000 & year %in% c("2020", "2021", "2022"),
       select = schools)
   schools
14    2005
17    2001
20    1994

You can read more about subsetting data frames in Section 1.12 Using Logical Operators with Data Frames in Probability and Statistics with R.

3.5 Summarising data

Data sets will often contain a lot of information which is not easy to interpret at a glance. It is therefore useful to be able to summarise the data they contain, in appropriate ways for each different type of variable.

One of the simplest functions to help summarise a data frame is the summary() function.

summary(chol)
      id                 ldl             hdl           trig      
 Length:13          Min.   : 82.0   Min.   : 25   Min.   : 59.0  
 Class :character   1st Qu.:139.0   1st Qu.: 49   1st Qu.: 99.0  
 Mode  :character   Median :149.0   Median : 54   Median :132.5  
                    Mean   :144.9   Mean   : 62   Mean   :122.2  
                    3rd Qu.:169.0   3rd Qu.: 65   3rd Qu.:147.5  
                    Max.   :196.0   Max.   :117   Max.   :168.0  
                                                  NA's   :3      
      age           gender         smoke  
 Min.   :30.00   female:10   no       :6  
 1st Qu.:39.00   male  : 3   ex-smoker:4  
 Median :42.00               current  :2  
 Mean   :46.15               NA's     :1  
 3rd Qu.:55.00                            
 Max.   :69.00                            
                                          

The output from summary() shows information for each column in the data frame you provide as the argument. For numerical variables, we are shown summary statistics such as the minimum value, the mean or the 3rd quartile. For factor variables, we are shown how many observations there are in each level of the factor. If there are any NA values in a column, the total number of these will also be shown for each variable.

When a data frame contains categorical variables, a neater way to summarise the counts of the different levels is in contingency tables. These show counts of how many times each level of a categorical variable appeared in the data frame. The function to create contingency tables in R is table(). The only argument that table() needs is the factor variable you want to summarise.

For example, we can quickly show counts of how many subjects in chol fall into each of the three levels of the smoke variable using the following code.

table(chol$smoke)

       no ex-smoker   current 
        6         4         2 

If we wanted to further split these counts by the variable gender, then we simply add this as a second argument to the table() function.

smoke_counts <- table(chol$smoke, chol$gender)
smoke_counts
           
            female male
  no             5    1
  ex-smoker      4    0
  current        1    1

We can easily compute the sums of rows or columns in a table using the function margin.table(). Here we need to provide margin.table() with the following arguments:

  • x =: this is the table you want to sum over.
  • margin =: this tells R whether you want to sum over rows (set the value to 1), or columns (set the value to 2).

For example, we can use the table smoke_counts, created above, to count the number of female and male subjects for whom we know their smoking status, using margin.table().

margin.table(x = smoke_counts, margin = 2)

female   male 
    10      2 

Another useful function to use with tables is prop.table(). This takes the same arguments as margin.table() but shows row or column proportions, rather than sums.

For example, to calculate the proportions of current smokers, ex-smokers and non-smokers that are female and male, we can use the following code.

prop.table(x = smoke_counts, margin = 1)
           
               female      male
  no        0.8333333 0.1666667
  ex-smoker 1.0000000 0.0000000
  current   0.5000000 0.5000000

Suppose you wanted to calculate summary statistics for one variable in a data frame, but have it split by the levels of a different categorical variable.

The function in R which calculates a summary statistic for one numeric variable, split by the levels of a factor is tapply(). The arguments that tapply() can take are as follows:

  • X =: this is the numeric variable that you want to apply the function calculating some summary statistic to.
  • INDEX =: this is a list containing the categorical variable (or variables) you want to split the calculation of the summary statistic across.
  • FUN =: this is the name of the function you want to apply to the numeric variable. Examples include mean, median, max, min, mode, sd etc.

In the case where we are interested in knowing the mean HDL for subjects who were current smokers, subjects who were ex-smokers and subjects who were non-smokers, we can use tapply().

tapply(X = chol$hdl, INDEX = list(chol$smoke), FUN = mean)
       no ex-smoker   current 
 61.16667  64.75000  57.50000 

We can see, for example, that the mean HDL for non-smokers is 61.17.

The list provided to the INDEX = argument can contain more than one categorical variable. For example, we can calculate the mean HDL of females and males for each level of the smoke variable using the following code.

tapply(X = chol$hdl, INDEX = list(chol$smoke, chol$gender), FUN = mean)
          female male
no         57.20   81
ex-smoker  64.75   NA
current    52.00   63

Now we can see that the mean HDL for females who are non-smokers is 57.20. The mean HDL for males who are ex-smokers is NA because there are no males included in chol who are ex-smokers.

Task

What is the mean total number of teachers in primary schools across all years?

In order to find this value we want to use the function tapply(). teachers is the column we want to calculate the mean for, but make sure to split this by the different levels in the level column.

teachers contains some NA values, which when passed to the function mean will return another NA value unless you provide to tapply() the additional argument na.rm = TRUE. This tells R to ignore the NA values when calculating the mean and only use those rows which have a numerical value.

tapply(X = education$teachers, INDEX = list(education$level),
       FUN = mean, na.rm = TRUE)
       ELC    Primary  Secondary 
  831.3333 25055.5000 23811.2857 

To read more on creating tables and summarising data in R, see Sections 1.13 Tables and 1.14 Summarizing Functions in Probability and Statistics with R.

3.6 Creating variables

In the case where we have another vector or data frame that we wish to join to an existing one, we can do this using one of the functions cbind() or rbind().

  • cbind() combines the vectors or data frames together by making additional columns, whereas

  • rbind() combines them by adding the new vector or data frame as additional rows.

Let’s see an example to understand how this works. The file measurements.csv contains information on the heights and weights of all 13 patients in the original chol data frame. We can begin by reading it in to the Environment tab using the following code.

measurements <- read.csv(file = "measurements.csv")

We can then add measurements to chol as two additional columns and save the resulting data frame as chol_full using the code below.

chol_full <- cbind(chol, measurements)
head(chol_full)
    id ldl hdl trig age gender     smoke weight height
1 P912 175  25  148  39 female        no  90.77   1.69
2 P215 196  36   92  32 female        no  75.06   1.75
3 P063 139  65   NA  42   male      <NA>  73.99   1.84
4 P117 162  37  139  30 female ex-smoker  86.25   1.83
5 P613 140 117   59  42 female ex-smoker  76.95   1.81
6 P332 147  51  126  65 female ex-smoker  57.66   1.75

Another way to easily create a new variable in a data frame is to use the $ operator. We can simply add the name of the data frame to the left of $ and our new variable name to the right. Then we can set this variable to be any pre-existing vector, or calculate a new vector based on variables from the data frame.

For example, if we wanted to create a new variable, bmi, in chol which shows the BMI of each patient, then we can use the following code.

chol_full$bmi <- chol_full$weight/(chol_full$height)^2
head(chol_full)
    id ldl hdl trig age gender     smoke weight height      bmi
1 P912 175  25  148  39 female        no  90.77   1.69 31.78110
2 P215 196  36   92  32 female        no  75.06   1.75 24.50939
3 P063 139  65   NA  42   male      <NA>  73.99   1.84 21.85432
4 P117 162  37  139  30 female ex-smoker  86.25   1.83 25.75473
5 P613 140 117   59  42 female ex-smoker  76.95   1.81 23.48829
6 P332 147  51  126  65 female ex-smoker  57.66   1.75 18.82776

Task

In the education data frame, create a new variable called ratio which calculates the pupil to teacher ratio in each level of education. That is,

\[\mbox{ratio}=\textstyle\frac{\mbox{puils}}{\mbox{teachers}}\]

education$ratio <- education$pupils/education$teachers

Now suppose that information on a fourteenth subject is known but has not been included in the original chol data frame. This data is shown in Table 3.1 below.

Table 3.1: Cholesterol data for the fourteenth patient.
id ldl hdl trig age gender smoke weight height
P461 148 78 120 41 male current 84.05 1.79

In this case we can add the new subject as an additional row using the rbind() function.

First, we need to create a data frame containing the information for this subject. In order for us to add this data frame as a row to chol_full, it needs to have the same number of variables. Therefore, we also need to calculate the BMI for this subject and call it bmi. We can do all this with the following code.

subject <- data.frame(id = "P461", ldl = 148, hdl = 78, trig = 120, age = 41, 
                      gender = "male", smoke = "current", weight = 84.05,
                      height = 1.79)

subject$bmi <- subject$weight/(subject$height)^2

Now we can add this subject to chol_full using the code below.

chol_full <- rbind(chol_full, subject)
tail(chol_full)
     id ldl hdl trig age gender     smoke weight height      bmi
9  P725 149  49   NA  32 female        no  65.37   1.67 23.43935
10 P901  95  54  157  55 female ex-smoker  80.34   1.62 30.61271
11 P103 169  59   67  48 female        no  74.90   1.61 28.89549
12 P843 174 117  168  41 female        no  63.78   1.77 20.35813
13 P753  91  52  146  69 female   current  71.58   1.62 27.27481
14 P461 148  78  120  41   male   current  84.05   1.79 26.23202

Note that tail() is a function very similar to head(), but rather than showing the first 6 rows by default, it shows the last 6.


Sections 1.11.2 Creating New Variables in a Data Frame and 1.13 Tables of Probability and Statistics with R describe how to create new variables.

See Appendix A to learn how to create a new variable in a data frame by breaking an exisiting variable into different levels.